Red Wine Exploration by Darui Zhang
What property makes good red wine? In this project we try to answer this question by exploring the red wine data set.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
Univariate Plots Section
Feature Names and Summary
This red wine data set contains 1,599 obersvations with 11 variables on the chemical properties of the wine.
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Quality Distribution
The wine quality grade is a discrete number. It is ranged from 3 to 8. The median value is at 6.

Distribution of Other Chemical Properties
## Warning: position_stack requires constant width: output may be incorrect

Univariate Analysis
Some observed on the distribution of the chemical property can be made:
Normal: Volatile acidity, Density, PH
Positively Skewed: Fixed acidity, Citric acid, Free sulfur dioxide, Total sulfur dioxide, Sulphates, Alcohol
Long Tail: Residual sugar, Chlorides
Rescale Variable
The skewed and long tail data can be transformed toward more normally distribution by taking square root or log function. Take Sulphates as a example, we compare the original, square root and log of the feature.

Both the square root and the log function helps transform the feature toward normal distribution. In comparison, the log scale feature is more normal distributed.
Bivariate Plots Section
Bivariate Plots Selection
Plot matrix was used to have a glance at the data. We are interested the correlation between the wine quality and each chemical property.

The top 4 factor that is correlated with the wine quality (with a correlation value greater than 0.2)
| alcohol |
0.476 |
| volatile.acidity |
-0.391 |
| sulphates |
0.251 |
| citric.acid |
0.226 |
Bivariate Analysis
Alcohol content has the biggest correlation value to the wine quality. The scatter plot of alcohol and wine quality is shown below.

The original plot looks over plotted, so we add alpha value and 0.1, 0.5 and 0.9 percentile line to show the general trends.

In this plot the trend of increasing wind quality with the increasing of alcohol content can be clearly observed.
Distribution Analysis
In this analysis, we try to find if the distribution of the chemical properties are different in each wine quality.

Note that sine the data size for each quality is not equal, the distribution of higher and lower grades are hard to see.
A normalized plot is shown below.

The plot looks a little busy. We ground 2 grade together: grade 3,4 as “Low”, grade 5,6 as “Medium”, grade 7,8 as “High”. And plot again.

The new plot looks cleaner.
Similar analysis was done the 3 other factors: volatile acidity, sulphates and citric.acid


As stated in section 1 the sulphates data is skewed, we tried both the original and the log scale of the feature.


The log scaled feature looks better.
Correlation Between Features
There is interesting correlaiton between two of the main features: Volatile acidity and Citric acid.
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

##
## Pearson's product-moment correlation
##
## data: redwine$volatile.acidity and redwine$citric.acid
## t = -26.4891, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5856550 -0.5174902
## sample estimates:
## cor
## -0.5524957
Multivariate Plots Section
Main Chemical Property vs Wine Quality
With different color, we can add another dimension into the plot. There are 4 main features.Alcohol, volatile acidity are the top two factor that affect wine quality.

The figure looks over ploted, since the wine quality is discrete numbers. The use jitter plot to alleviate this problem

We can see higher quality wine have alcohol and lower volatile acidity.
Add Another Feature
Now we add the third feature, the log scale of sulphates, and use different facet to show wine grade.

We can see higher quality wine have higher alcohol (x-axis), lower volatile acidity (y-axis) and higher sulphates.
Main Chemical Properties vs Wine Quality
Since we can visualized 3 dimensions, including wine quality, at a time. Two graphs will be needed to visualize the 4 main chemical properties.

The same trend of alcholand volatile acidity’s effect on wind qaulity can be observed.

We can see higher quality wine have higher sulphates (x-axis), higher citric acidity (y-axis).
Linear Multivariable Model
Linear Multivariable model was created to predict the wine quality based on chemical properties.
The features are selected incrementally in order of how strong the correlation between this feature and wine quality.
##
## Calls:
## m1: lm(formula = quality ~ volatile.acidity, data = redwine)
## m2: lm(formula = quality ~ volatile.acidity + alcohol, data = redwine)
## m3: lm(formula = quality ~ volatile.acidity + alcohol + sulphates,
## data = redwine)
## m4: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid, data = redwine)
## m5: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides, data = redwine)
## m6: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides + total.sulfur.dioxide, data = redwine)
## m7: lm(formula = quality ~ volatile.acidity + alcohol + sulphates +
## citric.acid + chlorides + total.sulfur.dioxide + density,
## data = redwine)
##
## ==================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## --------------------------------------------------------------------------------------------------
## (Intercept) 6.566*** 3.095*** 2.611*** 2.646*** 2.769*** 2.985*** -0.953
## (0.058) (0.184) (0.196) (0.201) (0.202) (0.206) (11.990)
## volatile.acidity -1.761*** -1.384*** -1.221*** -1.265*** -1.155*** -1.104*** -1.114***
## (0.104) (0.095) (0.097) (0.113) (0.115) (0.115) (0.120)
## alcohol 0.314*** 0.309*** 0.309*** 0.292*** 0.276*** 0.280***
## (0.016) (0.016) (0.016) (0.016) (0.017) (0.020)
## sulphates 0.679*** 0.696*** 0.871*** 0.908*** 0.903***
## (0.101) (0.103) (0.111) (0.111) (0.112)
## citric.acid -0.079 0.021 0.065 0.044
## (0.104) (0.106) (0.106) (0.124)
## chlorides -1.663*** -1.763*** -1.747***
## (0.405) (0.403) (0.406)
## total.sulfur.dioxide -0.002*** -0.002***
## (0.001) (0.001)
## density 3.923
## (11.944)
## --------------------------------------------------------------------------------------------------
## R-squared 0.153 0.317 0.336 0.336 0.343 0.352 0.352
## adj. R-squared 0.152 0.316 0.335 0.334 0.341 0.349 0.349
## sigma 0.744 0.668 0.659 0.659 0.656 0.651 0.652
## F 287.444 370.379 268.912 201.777 166.407 143.910 123.298
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1794.312 -1621.814 -1599.384 -1599.093 -1590.662 -1580.192 -1580.138
## Deviance 883.198 711.796 692.105 691.852 684.595 675.689 675.643
## AIC 3594.624 3251.628 3208.768 3210.186 3195.324 3176.384 3178.276
## BIC 3610.756 3273.136 3235.654 3242.448 3232.964 3219.401 3226.670
## N 1599 1599 1599 1599 1599 1599 1599
## ==================================================================================================
The model of 6 features has the lowest AIC (Akaike information criterion) number. As the number of features increasing the AIC becomes higher. The parameter of the predictor also changed dramatically which shows a sign of overfitting.
The model can be described as:
wine_quality = 2.985 + 0.276xalcohol - 2.985xvolatile.acidity + 0.908xsulphates + 0.065xcitric.acid - -1.763*chlorides - 0.002xtotal.sulfur.dioxide
Final Plots and Summary
Plot One

Description One
The median value of chemical properties at each wine quality is shown. The value is normalize by the maximum value so that all the values ranges from 0 to 1. The features with monotonically increasing or decreasing trends has higher correlation with the wine quality, such as volatile acidity. Features that are flat or not monotonical have less correlation with wine quality, such s density and free sulfur dioxide.
Plot Two

Description Two
The 4 features that have the highest correlation coefficient are alcohol(0.476), volatile acidity(-0.391), sulphates(0.251),citric acid(0.226). The wine quality are grouped to low (3,4) medium (5.6) and high(7,8). High quality wine have high alcohol level however, there is no significant different between medium and low quality wine. Volatile acidity decrease as wine quality increases. Sulphate and critic increase as wine quality increase.
Plot Three

Description Three
The 4 features are also represented in the scatter plot. 2 features are plotted at a time with color indicate wine quality. Similar trend as the last figure can be observed.
Reflection
The red dataset contains 1,599 observation with 11 variables on the chemical properties. We are interested in the correlation between the features and wine quality. Unlike the diamond price, which is the dominated by their size or carat. The wine quality is more complex. It does not have a obvious driver. Most of the data visualization in this project was done on the 4 features that have the highest correlation coefficient: alcohol(0.476), volatile acidity(-0.391), sulphates(0.251),citric acid(0.226). After some web research, the reflection about these chemical component are as follows.
Alcohol: surprising and unsparingly, alcohol is the No.1 factor correlated to the wine quality. The data strongly suggest that the higher the alcohol content, the more likely the better wine quality. One suggestion is that wine of higher alcohol are made from riper grapes, which tend to have intense flavor. Therefore, the relation between alcohol and wine quality are more likely to be correlation rather than causation. There is also controversy about alcohol level. One article even says “high alcohol is a wine fault not a badge of honor”. [1][2]
Volatile acidity: volatile acidity has a negative correlation to wine quality. Volatile acidity can contributed to acidic tastes which is often considered a wine faults.[3]
Sulphates: sulphates has a positive correlation with wine quality. It is often added by winemakers to prevent spoilage. It is less likely that sulphates itself contribute to better taste or aroma. It present simply means the wine is less likely to be spoiled.[4]
Citic acids: unlike volatile acid, citic acids has positive correlation with wine quality . Winemaker often add citric acid to give a “freshness” test. However it can also bring unwanted effects through bacteria metabolism.[5]
Surprisingly, other chemical proprieties do not have strong correlation with wine quality, such as the residual sugar and PH .
In the end, a linear model of 6 features was created to predict wine quality. However, wine quality is a complex object. Different type grape can largely affect the wine test. There are many nuance in taste and aroma the that cannot be capture by examine its chemical component. The linear model is a overly simplified model. Good wine is more than perfect combination of different chemical components.